Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend genes in complete genomes to start/stop codons #5

Merged
merged 16 commits into from
Oct 6, 2021

Conversation

ninewise
Copy link

Fixes #3

@ninewise ninewise marked this pull request as draft September 16, 2021 20:37
@ninewise ninewise force-pushed the fix/complete-start branch 2 times, most recently from bdb2aa0 to ec569a9 Compare September 16, 2021 21:15
@ninewise
Copy link
Author

It would seem a simple T replaced by an A can have consequences.

The main difference remaining between the FGSrs and Prodigal output is now the exclusion of the reverse stop codon in the reverse strands by FGSrs, where Prodigal includes this. Which would be the expected behaviour?

@ninewise ninewise marked this pull request as ready for review September 17, 2021 10:01
@pdawyndt
Copy link
Member

pdawyndt commented Sep 17, 2021

For me, the expected behaviour is uniform treatment of stop codons for proteins on forward and reverse strands.

@jianshu93
Copy link

I agree with this. Now FGSrs is inconsistent for + and - in terms of include stop/start codon or not compare to prodigal

Jianshu

@jianshu93
Copy link

a real world application using gene prediction: mapping transriptomic reads(pair ends sequencing)to each gene to see whether there are reads mapped to antisense(reverse strand of gene strand)showed that for reverse reads,the proportion of them mapped to the reverse strand of gene strand is a lot higher than forward reads for FGSrs but ver close for prodigal. So I would think now FGSrs reverse gene position is not so perfect.

Many thanks!

Jianshu

@jianshu93
Copy link

NTO_MAG_1_prodigal.new.gff.zip
NTO_MAG_1_gene.gff.zip

Hello, I tried the newest fix 02cc0fe. Now start and stop codons look good. But for some genes that are not consistent, FGSrs is always short than prodigal (large start position, same stop position). Others looks good.

Jianshu

Felix Van der Jeugt added 9 commits September 21, 2021 14:02
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             1917354    383036    508737    856214     83.35     69.13     57.05     37.27     23.23
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             1917354    383036    508737    856214     83.35     69.13     57.05     37.27     23.23
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             2601595    527580    367165    169001     83.14      93.9     41.04     68.48     42.47
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             2650006    517695    371428    126212     83.66     95.45     41.77     74.64     46.59
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             2652412    517860    371290    123779     83.67     95.54     41.76      75.0     46.78
Ratings:

 tool           TP      FP      TN      FN   prec   sens   spec    NPV    MCC
 FGS       2494195  231257  656879  283010  91.51  89.81  73.96  69.89  62.58
 FGS+      2661256  533556  361341  109188   83.3  96.06  40.38  76.79  46.79
 prodigal  2488137  117535  745673  313996  95.49  88.79  86.38  70.37  70.36
 FGSrs     2661082  518282  368500  117477   83.7  95.77  41.55  75.83  47.14
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             2661082    518282    368500    117477      83.7     95.77     41.55     75.83     47.14
tool                   TP        FP        TN        FN      prec      sens      spec       NPV       MCC
FGS               2494195    231257    656879    283010     91.51     89.81     73.96     69.89     62.58
FGS+              2661256    533556    361341    109188      83.3     96.06     40.38     76.79     46.79
prodigal          2488137    117535    745673    313996     95.49     88.79     86.38     70.37     70.36
FGSrs             2661082    518282    368500    117477      83.7     95.77     41.55     75.83     47.14
tool                    TP         FP         TN         FN       prec      sens      spec       NPV       MCC
FGS                2494195     231257     656879     283010      91.51     89.81     73.96     69.89     62.58
FGS+               2661256     533556     361341     109188       83.3     96.06     40.38     76.79     46.79
prodigal           2488137     117535     745673     313996      95.49     88.79     86.38     70.37     70.36
FGSrs              2504714     280484     607613     272530      89.93     90.19     68.42     69.04     58.78
Felix Van der Jeugt added 2 commits September 22, 2021 17:55
tool                    TP         FP         TN         FN       prec      sens      spec       NPV       MCC
FGS                2494195     231257     656879     283010      91.51     89.81     73.96     69.89     62.58
FGS+               2661256     533556     361341     109188       83.3     96.06     40.38     76.79     46.79
prodigal           2488137     117535     745673     313996      95.49     88.79     86.38     70.37     70.36
FGSrs              2494195     231257     656879     283010      91.51     89.81     73.96     69.89     62.58
@jianshu93
Copy link

Hello This time it improves by a lot. The ffn and faa output positions in the head is consistent with gff output but stop and start codon still not included in ffn and faa sequence.

Jianshu

@jianshu93
Copy link

jianshu93 commented Sep 22, 2021

More than that. The ffn and faa sequences (head is correct) output is not consistent with the prodigal fna and faa output even though the gff output is the same with prodigal gff output (output ffn part not fixed according to new gene positions). Everything else looks good. Many thanks for all those fixes and I think this should be the last one then FGSrs should be almost consistent with prodigal. Thanks again and good luck with the manuscript! You could try bmc bioinformatics.

If you would like any testing and comparisons with prodigal/metaGeneMark,I will be very happy to test and by the way Mark Borodosky is my co-advisor here at Georgia tech. Any thing related to model improvement, he may be the one to give some advice.
Many thanks,
Jianshu.

@ninewise
Copy link
Author

ninewise commented Sep 23, 2021

And that last commit should finish the PR. You might have noted the scripts in the meta/evaluation directory. These have helped me to check the quality of the predictions (given a complete genome with annotations from ENA). The rates script reports below table (true positives the number of bp in a prediction on the correct strand, false positives the number of bp in a prediction outside annotations or on the wrong strand, true negatives the number of bp outside annotations which weren't in any prediction and false negatives the number of bp inside annotations which weren't in any prediction; and prec(ision), sens(itivity), spec(ificity), Negative Predictive Value and Matthew's Correlation Coefficient in percentages).

tool TP FP TN FN prec sens spec NPV MCC
FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58
FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79
prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36
FGSrs 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58

Seems like in quality, prodigal still outperforms FragGeneScan(Rs). On the other hand, FGSrs is 37.3 times faster for the used dataset (see meta/evaluation/README.md).

We weren't really working on model improvement for now, just speed and code quality while replicating FGS behaviour, but this might be interesting in the future.

@ninewise ninewise requested a review from pdawyndt September 23, 2021 13:42
@ninewise
Copy link
Author

@jianshu93 I'd be interested in a final review from you, too, but I can't seem to request it via the interface.

@jianshu93
Copy link

Hello Felix,

I had a look at the evaluation and your benchmarking. I do not see anything wrong this time. My testings are nice considering output. I think the genome you are using are complete genomes right? prodigal actually use a different model for complete genomes. The reason why I use (-p meta) in prodigal is that my genome is not complete genome (there are gaps, not circular because they are assembly with short reads from metagenomes, not cultivated single genome). I think you would also want to try no meta to compare even though the complete model also works for contigs/assemblies (not circular) according to the original FGS description. Prodigal is a little bit different for complete and meta mode. See the paper here: https://academic.oup.com/bioinformatics/article/28/17/2223/246063?login=true

I think in the future FGSrs should also differentiate between complete genome and genome with gaps or assemblies.

Jianshu

@jianshu93
Copy link

Hello Felix, when all those commits are merged, please also create a condo recipe for FGSrs. A lot of softwares like the Metagenomics binning software MaxBin are now using FGS, which is very slow for large metagenomes. This could be a greate improvement

Jianshu

@jianshu93
Copy link

jianshu93 commented Sep 26, 2021

Hello Felix,

Another error I notice:

thread '' panicked at 'attempt to divide by zero', src/dna.rs:644:21
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

For some contig (less than 1000 bp, can reach 400 bp) run with compete mode there will always be this error. Any idea why?

Thanks,

Jianshu

@ninewise
Copy link
Author

That last commit should fix the 'divide by zero'-panick: I was dividing by the length of the sequence without checking whether it was empty. I've also put adding FGSrs to bioconda on the planning. It probably shouldn't be hard, since I can just follow the FGS recipe.

Could you perhaps suggest a good contig/assembly for running this test on? Many aren't properly annotated, and without annotations I can't make the same evaluation. (Unless I assume prodigal is always correct, and compare FGSrs to prodigal, not to the annotations.)

@ninewise ninewise merged commit 2757c4e into main Oct 6, 2021
@ninewise
Copy link
Author

ninewise commented Oct 6, 2021

I've merged this PR since it solves some major problems users may currently be encountering. I've created issue #7 to track Bioconda and issue #8 to track the additional benchmark data.

@jianshu93
Copy link

Hello Felix,

Sorry for such late response. I ran a test and this version is ok. I am wondering whether we can implement a fast AAI (average amino acid identity) module following FGSrs. which is to find best AA gene of one genome in another genome and also reverse. If the 2 are reciprocal best hit and then we keep this pair and do it for all genes in one genome and another genome and keep all reciprocal best hits and calculate average AAI based on all those genes. Seems search will take some time for each gene, diamond blastp could be better than we implement local alignment ourselves based on rust-bio.

do you have better idea if we do not rely on external database search software such as diamond to calculate AAI?

Many thanks,

Jianshu

@ninewise
Copy link
Author

I'm afraid I'm currently finishing my PhD, so I don't really have time to look into tangents at the moment. This does sound like an interesting idea though.

@jianshu93
Copy link

Hello Felix,

Good luck with you Ph.D! It's glad to hear that you are finishing it! I still have 3 years to go! Anyhow, I will implement it myself, possibly rely on blast+ suite to do best hit searching.

Jianshu

@jianshu93
Copy link

Hello Felix,

Can you also add a support for gzipped fasta file as input? Should be very easy to implement? Or is is already supported? Think about when we have huge number of genomes downloaded from NCBI, which are all in *.fna.gz format.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

results not consistent with FGS+ and prodigal
3 participants